10. Analyzing Dataset for High Cardinality
Analyzing Dataset for High Cardinality
ND320 AIHCND C01 L01 A10 Analyzing Dataset For High Cardinality
High Cardinality
Cardinality: refers to the number of unique values that a feature has and is relevant to EHR datasets because there are code sets such as diagnosis codes in the order of tens of thousands of unique codes. This only applies to categorical features and the reason this is a problem is that it can increase dimensionality and makes training models much more difficult and time-consuming.
How do we define a field with high cardinality?
- Determine if it is a categorical feature.
- Determine if it has a high number of unique values. This can be a bit subjective but we can probably agree that for a field with 2 unique values would not have high cardinality whereas a field like diagnosis codes might have tens of thousands of unique values would have high cardinality.
- Use the
nunique()
method to return the number of unique values for the categorical categories above.
Additional Resources
High Cardinality Quiz
SOLUTION:
- Principal diagnosis code
- Zip Code
Code
If you need a code on the https://github.com/udacity.